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Abstract: In this paper, we seek to establish asymptotic results for se¬ 
lective inference procedures removing the assumption of Gaussianity. The 
class of selection procedures we consider are determined by affine inequal¬ 
ities, which we refer to as affine selection procedures. Examples of affine 
selection procedures include selective inference along the solution path of 
the LASSO, as well as selective inference after fitting the LASSO at a fixed 
value of the regularization parameter. We also consider some tests in penal¬ 
ized generalized linear models. Our result proves asymptotic convergence 
in the high dimensional setting where n < p, and n can be of a logarithmic 
factor of the dimension p for some procedures. Our method of proof adapts 
a method of Chatterjee (2005). 

AMS 2000 subject classifications: Primary 62M40; secondary 62J05. 
Keywords and phrases; selective inference, non-gaussian error, high¬ 
dimensional inference, LASSO. 


1. Introduction 

Selective inference is a recent research topic that studies valid inference after a 
statistical model is suggested by the data Fithian et al. (2014), Lee et al. (2013), 
Taylor et al. (2014, 2013). Classical inference tools break down at this point as 
the data used for the hypothesis test is allowed to be the data used to suggest 
the hypothesis. Specifically, instead of being given a priori, the hypothesis to 
test is dependent on the data, thus random. Formally, denoted by £* = £*{y, X) 
is the model selection procedure, which generates a set of hypotheses to test, 
or perhaps parameters for which to form intervals. It is useful to think oi £* 
as a point process with values in S, where S is some collection of questions of 
possible interest. Consider the following example. 

Suppose y\X ^ G with y G K", V G X fixed. For any FI C {1,... ,p} 

define the functionals 

Pj,E{G) = ej argminEcdl?/ - Xe^eI^IX) j G E, 

P 

where Cj is the unit vector with only the j-th entry being 1. Such functionals 
is essentially the best linear coefficients within the model consisting of only 
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variables in E. Then the collection of possily interesting questions are 



The data {y,X) will then suggest a subset of interesting variables E, and 
£*{y,X) designates the target for inference to be {l3j,E, j € E}, the best linear 
coefficient within a model consisting of only the variables in E. 

Previous literature has studied inference after different model selection pro¬ 
cedures £*. Notably Lee et al. (2013) proposed an exact test within the model 
suggested by LASSO, that is £* = j G E}, where E is the active set of 

the LASSO solution. The test is based upon a pivotal quantity which the au¬ 
thors prove to be distributed as Unif(0,1) if the hypothesis to be tested is true. 
Thus such quantity Pj{y) can be used to test the hypothesis Hqj : = 0, 

and control the “Type-I error” at level a, 


P {Pj{y) < a I Hoj is true) < a. 


( 1 ) 


By inverting such tests, Lee et al. (2013) can also construct valid confidence 
intervals for 

It is of course worth noticing that either the hypothesis H^j or the parameters 
Pj^E are random as E is suggested by the data. So the “Type-I error” (1) is not 
the classical Type-I error definition where the hypotheses are given a priori. 
Such inference framework is first considered in Berk et al. (2013), and we leave 
the philosophical discussions of such approach to Fithian et al. (2014). 

The means by which Lee et al. (2013) controls the “Type-I error” is through 
constructing the p-value functions Pj. Such construction is highly dependent 
on the assumption of normality of the error distribution. Other works like 
Lockhart et al. (2013), Taylor et al. (2014) used similar approaches. Compared 
to these previous work, we seek to remove the Gaussian assumption on the errors 
and establish asymptotic distributions of Pj in this work. We state the condi¬ 
tions under which Pj will be asymptotically distributed as Unif(0,1), and thus 
Pj can be used as p-values to test the hypotheses and asymptotically control the 
“Type-I error” in (1). This allows asymptotically valid inference in the linear 
regression setting without normality assumptions. It also allows application of 
covariance test (Lockhart et al. 2013) in generalized linear models. 

1.1. Related works 

Tibshirani et al. (2015) also considers uniform convergence of the statistics pro¬ 
posed by Taylor et al. (2014), but focuses mainly on the low dimensional case. 
In the high dimensional case, they have a negative result on the uniform con¬ 
vergence of the pivot. In this paper, we instead focus on the high dimensional 
case and state the conditions in which the pivot will converge. More specifically, 
n is allowed to be of a logarithmic factor of the dimension p for two common 
procedures introduced in Section 4. 
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In the works of Belloni et al. (2012), Meinshausen et al. (2012), Zhang & Zhang 
(2014), Javanmard & Montanari (2015), the authors proposed various ways of 
constructing confidence regions for the underlying parameters in the high-dimensional 
setting. One major difference between these works and our framework is that 
they try to achieve full model inference without using the data to choose a 
hypothesis. The advantage of such approach is robustness. But in the high¬ 
dimensional setting, with tens of thousands of potential variables, it is natu¬ 
ral to use the data to select hypotheses of interest and perform valid infer¬ 
ence only for those hypotheses. In addition, some of the full model inference 
works require conditions of linear underlying model Meinshausen et al. (2012), 
Javanmard & Montanari (2015) which the framework of selective inference does 
not require. For more philosophical discussions on the comparisons of the two 
approaches, see Fithian et al. (2014). 

1.2. Organization of the paper 

In Section 2, we formally introduce the methods for selective inference with 
certain model selection procedures, which we call affine selection procedures. 

In Section 3, we state the main theorem that will allow asymptotically valid 
inference. In Section 4, we will illustrate the applications of our results to two 
selective inference problems, selective inference after solving the LASSO at a 
fixed A, and the covariance test for testing the global null in generalized linear 
models. We collect all the proofs in Section 5 and dicuss the directions of future 
research in Section 6. 

2. Selective inference -with affine selection procedures 

Suppose we have a design matrix X G IR"^^’, considered fixed, and 

gijxi G(g(xi), cr^(Xi)) (2) 

where Xi is the i-th row of the matrix X and a^) denotes any one-dimensional 
distribution with mean p, and variance cr^. We also denote p{X) = (/i(a;i),..., p{xn)) 
and S(W) = diag(cr^(a:i),... ,(T^(a:„)), a diagonal matrix with (T'^{xi) as the di¬ 
agonal entries. Some feature selection procedure is then applied on the data to 
select a subet E C {1,2,... ,p} and the target of inference will be f *(?/, X) = 

3 S E}. In general, we consider certain selection procedures called the 
affine selection procedures, 

Definition 1 (Affine selection procedure). Suppose a model selection procedure 
£* : R" X —>• S, where S is a finite set of models, 

S = {£i, ■ ■ ■ ,f|s|}- 

We call £* an affine selection procedure, if the selection event can be written 
as an affine set in the first argument of £*. Formally, £* is an affine selection 
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procedure if for each potential model to he selected £ € S, 

{£*{z,X) = £} = {A{£,X)z<b{£,X)}, (z, X) e R” x (3) 

where A € b and k gN are dependent only on £ and X. Moreover, 

the sets 

{A{£,, X)z < b{£,, X)} C R", * = 1,..., |5| 

are disjoint or their intersections have measure 0 under the Lebesque measure 
on R”. 

Examples of affine selection procedures include selection procedures that are 
based on E, the set of variables chosen by the data and usually some other 
information^. Various algorithms can be used to select E, e.g. E as the active 
set of the LASSO solution at a fixed A (Lee et al. 2013), E as the first variable 
to enter the LASSO or LARS path (Lockhart et al. 2013), (more generally any 
ii penalized generalized linear models) or E as the k variables included at the 
fc-th step of forward stepwise selection (Taylor et al. 2014). 

The works of Lee et al. (2013), Lockhart et al. (2013), Taylor et al. (2014, 
2013) have constructed valid p-values when the family G is the Gaussian family. 
Formally, the pivotal function depends on the following quantities. 


2.1. Notations 


The pivotal function is determined by the following functions. For any A G 

Rfcxn, 5 g S g R"Xn ^ g 


L(z; A, 6,S, 
U{z-A,b,E, 


a = a(A, b, S, r/) = 


AT.T] 
rf^YiiTj ’ 


aj<0 Oj 

. . bd — {Az)jajTi'^z 

r?) = min — - - - - -. 

aj>0 Oj 


Furthermore, we define 


F{x] ,m, a, b) 


4>((a; — rn)la) — 4>((a — m)fa) 
$((6 — m)la) — $((a — m)/a) 


( 4 ) 

( 5 ) 

( 6 ) 


which is the CDF of the univariate Gaussian law N(m,a^) truncated to the 
interval [a, 6]. 


2.2. A pivotal quantity with Gaussian errors 

Theorem 1 provides the construction of a pivotal function when the data is 
normally distributed and £* is an affine selection procedure. We denote the 


^See Section 4 for details 
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response variables to be y when G is the Gaussian family to distinguish it from 
y where G is a more general location-scale family. Note all distributions in this 
paper are conditional on X, that is the law we consider are either L(y\X) or 
C{y\X). All random variables have access to X as if it were a constant. 

Theorem 1 (Lee et al. (2013)). Suppose X G and y ~ N{fi{X),Ti{X)), 

y{X) G K”, S(A) G and S* is an affine selection procedure 07rR"xR"^P. 

Then any for any rj : R" -A R" measurable with respect to cr{£*) we have 


Fffi{£*fy- (j;), Gf. (3^))|f*(3^, X) = £ 



-Unif(0,1), 

(7) 

where £* {z, X) = £ 

A(£,X)z < b(£,X) and 


Lsiz) 

= L{z,Ai£,X),bi£,X),/:,ffi£)) 

(8) 

Usiz) 

= Uiz,A{£,X)ffi{£,X),E,ffi£)). 

(9) 


Moreover, marginalizing over the selection procedure £*, we have the following 

F{ffi£*fy- ffi£*fEr,{£*),ffi£*fy, Ls- {y), Us, {y)) ~ Unif(0,1). (10) 

The significance of Theorem 1 is that assuming the diagonal matrix E is 
known, the only unknown parameter for the pivotal quantity (10) is rj"'"y. To test 
the hypothesis Hq : rffiy = 0, we just need to plug in the value and then compute 
(10), which then can be used as a p-value to accept/reject the hypothesis. For 
example, if we take 

p = XE{XlXE)-^e,, ( 11 ) 

where Cj is the unit vector with only the j-th entry being 1, rj^pt = The 
quantity in (10) is pivotal and can be used to test the hypothesis Hoj : ffi^E = 0, 
and control the “Type-I errpr” (1). Since X is fixed, we use the shorthand 

£*{z) = £*{z,X), A{£) = A{£,X), b{£) = b{£,X). 


3. Asymptotics with non-Gaussian error 

Now if we remove the assumption that the error C{y\X) = N{ft{X),Y,), the 
conclusion of Theorem 1 does not hold any more. The best we can hope for is a 
weak convergence result that the same pivotal quantities (10) would converge to 
Unif(0,1) (as n -A oo). This requires some conditions on both the distribution 
C{y\X) and the selection procedure £*. Our main contribution in this work. 
Theorem 3 establishes conditions on C{y\X) and £* under which the pivotal 
quantity (10) is asymptotically distributed as Unif(0,1). 

The main approach is to compare the distribution of the pivots (10) under 
the distribution C{y\X) with that under Gaussian distribution C{y\X). In the 
latter case, the exact distribution is derived in Theorem 1. In the following, we 
establish the conditions where the above two distributions are comparable. 
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3.1. Bounding the influence function 


Note the pivotal quantity in (10) depends on y either through the linear func¬ 
tions ifl'y or the maximum/minimum of linear functions L£*{y), Us*{y). In 
approximating the exact Gaussian theory with asymptotic results a quantity 
analogous to a Lipschitz constant (in y) will be necessary, expressing the changes 
in rfly as well as the upper and lower bounds Lf and Us*. This, in some sense, 
describes the influence each yi can have on the pivotal quantity (10). 

For an affine selection procedure £* : K” x —>• S, without loss of gener¬ 

ality suppose £* is surjective. Since £* is affine, for any model £ £ S, there are 
the associated A{£) and h{£) as defined in (3). We define 


M{£,g) 


max 

l<2<nrow(A(£^)) 

l<//<n 


A[£),, 

{A{£)/:g{£)), 




M{£*, rj) = maxM(£, g). 


We also define 


r{£) — nrow(A(£’)), 


r{£*) = maxnrow(A(£)). 
s 


( 12 ) 


(13) 


The quantity M{£*,g) measures the maximal influence any yi has on a 
smoothed version of the triple {g{£*Y'y, L£*{y), Us* (y)). As M{£* ,g) and r{£*) 
are critical in bounding the difference between C{y\X) and C{y\X)^ it is impor¬ 
tant to get a sense of their size. Typically r[£*) is less than p, and we discuss 
the typical size of M{£*,g) through the following simple example: 

Example 3.1. Suppose the design matrix X £ is generated in the fol¬ 

lowing way: we first generate each row independently from a distribution on 
and then normalize the column of X to have length 1. Suppose instead of using 
data to select a model, we just arbitrarily choose a subset E. This is equal to no 
selection at all, thus M(£,g) = ||p(f)||oo. If we want to perform inference for 
Pj^E, we take 

V = XEiX^XE)~^ej, 

where Cj is the unit vector with only the j-th coordinate being 1. Since we normal¬ 
ize the columns, it is not hard to verify {XeXe)~^ = Op{\), and ma.Xij{Xij) = 
Op{n~^/^), thus if the selected variables set always satisfies \E\ <C n, p = 
Op(n“^/^). Therefore M{£*,g) = Op(n“^/^). 

This is a very simple example which does not involve selection. In reality 
we will some meaningful selection procedure that uses the data so M{£*,g) 
would involve A{£*) and b{£*) as well. However, we will see through examples 
in Section 4 that it is still reasonable to assume M{£*,g) = 0(n“^/^). 

The following theorem compares the distribution of {g{£*)'^y, Ls* (y), Us* (y)) 
under C{y\X) and its Gaussian counterpart. 

Theorem 2. Fix X £ Suppose (y,(V) are defined conditionally indepen¬ 

dent given X on a common probability space such that 
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• C{y\X) has independent entries with mean vector ft and covarianee matrix 
variance S and finite third moments bounded by 7 ; 

. c{y\x) = N{pt,/:). 

Suppose we are given 7 G a[£*), then given any bounded function W £ 
(R^; R) with bounded derivatives satisfying 


W (u, V, w) 


>0 if V < u < w 
= 0 else 


there exists N = N{M{£*,r]), \S\,r{£*),W), such that the following holds for 
n,p> N, 


[p{£*fy,Ls.{y),Us.{y)] 

<C{Wn) 


EtF [v{£*Yy,Ls4y),Ue.{y)\ 

\4 

log(r(f*)|5|)j nM{£\yf 


(14) 


where C{W,^) is a constant depending only on the derivatives ofW and 7 , and 
r]{£*) is r]{£*{y)) orr]{£*{y)) depending on the context. 

As it is reasonable to assume M{£*,r]) = it is reasonable to assume 

the RHS of (14) goes to zero. Thus the distribution of {r]{£*)'^y, Lg* (y), Ug* (y)) 
is close to that of {r]{£*)'^y,Lg»{y),Ug»{y)). In the following, we discuss the 
conditions under which the pivotal quantity ( 10 ) converges. 


3.2. Smoothness of the pivot 

Note the bound in (14) also depends on (7(14^, 7 ), the derivatives of W. Thus 
besides the influence of each y^ on ( 10 ), it is also necessary to control the smooth¬ 
ness of the ( 10 ). In particular, the pivot in ( 10 ) takes the form of a truncated 
Gaussian cdf. Moreover, the smoothness (derivatives) of the truncated Gaussian 
cdf F{x;a^,m,a,b) can depend heavily on the truncation interval [a, &]. More 
specifically, a lower bound on the denominator of F{x]a^,m,a,b) puts some 
constraints on the width of the interval [a, b] as well as its distance to the origin. 
In our context, a, b corresponds to the upper and lower bounds appearing in 
(10). Formally, we assume the following assumption: 

Assumption 1 . Suppose we have and yn £ R" is generated 

according to ( 2 ), and yn is generated independently (conditional on Xn) from 
A(y(A„), S(A„)) a Gaussian distribution with the same means and variances. 
We also have affine selection procedures £* = £„. We assume there exists (5„ —0 
such that 

P(Ge.(y„) - Lg*{yn) < 5n) -t 0, 

P(Gf.(3^„)-L£.(3^„) <<5„)^0, 

P(min(|t/£.(y„)|, |i£*(yn)|) > 1 /^n) -t 0 , 

P(min(|t/£.(3;„)|,|L£.(3;„)|) > l/5n) ^ 0. 


( 15 ) 
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The first two conditions in (15) puts a lower bound on the width of the trun- 
cation interval {Ls*{yn), Us*{yn))- The last two conditions makes sure the trun¬ 
cation will not appear too far from the origin and thus we will have reasonable 
behavior in the tail. is the rate at which the truncation interval will shrink 
(or the distance of the truncation interval to the origin). This rate will appear in 
the RHS of (14) and thus we impose a condition on r]n),r{£*), |5n|) 

to ensure the convergence of the pivot ( 10 ). 

3.3. Main result 

Suppose we have G and yn G K" is generated according to (2). We 

denote its distribution as £(y„jXn). The convergence mentioned below is under 
this sequence of distributions {£(ynl-^n)}^i- 

Theorem 3 (Convergence of the pivot). Suppose we have a sequence of yn 
generated as above with means y,n = y,{Xn), and variances = I](X„) and 
have finite third moments. We also assume Assumption 1 is satisfied with a 
sequence of dn ■ Furthermore, let £* be a sequence of affine selection procedures, 
rjn = ri{£f), and the corresponding M{£*,r]n), ’’’(Sn) and Sn properly defined as 
in Section 3.1. Then if 

l/5l-M{£*^,ynf-n 
we have 

-4 Unif(0,1), n^oo, (16) 

where P{x; a^,m, a,b) =2 min(T'(x; ,m,a,b),l — F[x\ a^, m, a, b)) is the two- 
sided pivot. 

In the following section, we apply Theorem 3 to different selection procedures. 

4. Examples 

We give two examples in this section as the applications of Theorem 3. The first 
example is to perform selective inference after solving the LASSO and the second 
is to test the global null in generalized linear models. In these two examples, we 
will explain why the selection procedure is affine, what is the data distribution 
£(yn\Xn) and the quantities {Sn, M{£*,r]n),r{£*), |5„|). To ease the notations, 
we suppress the dependencies on n whenever possible. It is helpful to keep in 
mind that yn G R", G 

f.l. Inference for LASSO with non-Gaussian errors 
Consider the linear model 

e. ~ G(0,a2), 


log(r(f:))+log(|5„|) 


0 , as n ^ oo. 


y = + e, X G 


(17) 
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where cr is known and the distribution G has finite third moments, but is not 
necessarily Gaussian. 

Tibshirani (1996) proposed the now famous LASSO. We get a sparse solution 
P by solving 

P = “ ^^Il2 + •^ll/?lli> (18) 

where A > 0 is the fixed regularization parameter. We choose A as in Negahban et al. 
(2012). If we normalize the columns of X to have norm 1, Negahban et al. (2012) 
chooses A to be O(v^logp). 

4 . 1 . 1 . Ajfine selection procedure 

As in Lee et al. (2013), we solve (18) and get a solution $. Now we consider the 
selection procedure based on (A, ze), where 

E = supp(/3), Ze = sign(/3£;), 

where Pe is P restricted to the active set E. Note this is different from the 
selection procedure based only on E but is closely related, for detailed discussion 
see Lee et al. (2013). The authors in Lee et al. (2013) proved such selection 
procedure is equivalent to the affine constraints A{E^ ZE)y < b^E., ze), where 

A{E, ze) = -d\ag{zE){XlXE)~^Xl, 
b{E,ZE) = -Xdiag{zE)iXEXE)~^ ZE- 

To test the hypothesis Hoj : Pj^e = 0 for any j G E, we choose t] to be as in 
( 11 )- 

In this case, a simple calculation will put the number of possible states at 
|5| = 2^, which will cause the bound in (14) to blow up when p > n. However, the 
choice of A = 0{\/logp) (Negahban et al. 2012) together with other conditions 
will ensure |iS| is polynomial in p with high probability. 

4 . 1 . 2 . Number of states |5| for A = 0{^/logp) 

Suppose X is column standardized to be mean zero and norm 1, we first intro¬ 
duce the restricted strong convexity condition for matrix X. 

Definition 2 (Restricted strong convexity Negahban et al. (2012)). We say 
X G satisfies the restricted strong convexity condition for index set A with 

constant m > 0 if 

\\Xv\\l>m\\v\\l 

for all V G {AgRp : ||A^cl|i < 3||A^||i}. 

Now we define the assumptions needed to ensure |5| is polynomial in p with 
high probability. 
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Assumption 2. X satisfies the restriced convexity condition for A = supp(/3*^) 
with constant m, and (fmax, the biggest eigenvalue of X^X is bounded by a 
constant Q. 

Assumption 3. Ci are sub-Gaussian errors with known variance . 

Assumption 4. The signal is sparse. More specifically, k = |supp(/3°)| is 
bounded by a constant K. 

Following Negaliban et al. (2012), Lemma 1 shows with the above assump¬ 
tions, the effective size of |iS| is polynomial in p with high probability. 

Lemma 1. With Assumptions 2-f, if we solve (18) with X > and get 

active set E, then with probability at least 1 — ci exp(—ciA^), 


\E\< 


leg' 




where ci is some constant that depends on m and the subgaussian constant of 
the error e. Thus, with probability 1 — ci exp(—ciA^), 


|5| < p^^, 


leg^ 

rn^ 


The proof of Lemma 1 is deferred to the appendix. Having controlled |5|, 
now we need to get a bound for the influences. 


4 . 1 . 3 . Bounding the influence M{£*,rf) 


Assume we have normalized the design matrix X columnwise so that each col¬ 
umn has norm 1. We further assume the following assumption on X, 


Assumption 5. Suppose we solve problem (18) with X and get the active set 
E. Let (prnin be the smallest eigenvalue for submatrices of size less than n x \E\, 
more specifically. 


min 

«GRp.||w||o<|B| 


Iklli 


We assume (t)min > > 0. 


Lemma 2. Suppose X satisfies Assumption 5, then 


max 

'iJ 


{{xIXe)-^xI)^ 


\E\ 

< ■ max Aij 

J/2 i,j 


4 . 1 . 4 . Choice of 6n in Assumption 1 

If we normalize the columns of X to have norm 1 and choose A = 0{y/logp) 
in (18) as in Negahban et al. (2012). Then we assume Assumption 1 is satisfied 
with Sn = 0((Vlog Pn)~^~'^), for any small k > 0 . 
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To avoid long passage and stay focused on the main topic, we illustrate that 
Assumption 1 is satisfied with such S^s in the following simplified setup. How¬ 
ever, the approach can be adapted to include more general cases. 

Lemma 3. Suppose Assumption 2-/ are satisfied. We further assume that ze = 
1 and the matrix {X]^Xe)~^ is equicorrelated, i.e. 


{{XIXe)-^)^^ = 
{{XIXe)-^) 
{{XIXe)-^) 


{{XIXe)-^)^^ = t > 0 , 

- > 0 , 'ii,jeE,i/=j. 


Then if ||/3°||oo = 0{Xn), Assumption 1 is satisfied with Sn = for 

any k > 0 . 

Remark 1. Note if we do not assume ze = 1, the last two conditions in As¬ 
sumption 1 are still satisfied with Sn = and the first two conditions 

can be satisfied with further assumptions. But we do not pursue the technical 
details here. 


4 . 1 . 5 . Convergence of selective tests in the Lasso problems 

Suppose we solve the Lasso problem (18) and get active set E, and want to test 
the hypotheses Hoj : Pj^e = 0, we can simply take 77 to be as in (11). Now we 
summarize the above results and apply Theorem 3 to get the following corollary 

Lemma 4. Suppose we solve the Lasso problem (18) with A„ = 4a\/logpn, 
and Assumption 1-5 are satisfied and the Sn’s in Assumption 1 is chosen as 
(logp„)“ 5 “ 5 «. If we further assume max\Xij\ = 0(n“^), ||/3°||oo = 0(\/logp„), 
and there exists k > 0 such that 

n“^/^(logp„)^^''"^'^^ —>■ 0, 

then the pivot in (16) calculated with the appropriate {rjn, Ls*,Usf) converges 
to Unif(0,1). Furthermore, we can construct a test for Pj^E based on this pivot 
that controls “Type-I error” (1) asymptotically. 


4-2. Covariance test for £i-penalized generalized linear models 

One of the first results in selective inference was the covariance test Lockhart et al, 
(2013) which provided an asymptotic limiting distribution for the first step of 
the Lasso or LARS path. An exact version of this test under Gaussian errors 
was described in Taylor et al. (2013). 

In the following, we generalize the covariance test for generalized linear mod¬ 
els. Suppose C{y\x) is in an exponential family. More specifically, 

p{y\x-, P°) = b{y) exp[(a;'^/3°)y - A(a;'^/3°)], 


( 20 ) 
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where and x are p-dimensional vectors and A{r]) is the cumulant generating 
function of the distribution. 

Suppose yi\xi are independently distributed according to the law above, 
where x/s are considered fixed. Then the ii penalized generalized linear re¬ 
gression can be expressed as 

^^=argmin V - logp{y^\xi;/3) + X\\/3\\i. (21) 

The covariance test for the global null i/g : /3° = 0 is based upon the the hrst 
knot on the solution path of (21), which is largest score statistic (in absolute 
values) at /3° = 0, 

Ai = sup {a : /3I O} = \\X^{y - VA(0))|U. (22) 

The variable to achieve the maximum in (22) will be the first variable to enter 
the solution path. 

The covariance test can also be viewed as a test for the coefficient with 
(potentially) the largest absolute values. A guess for such variable is the hrst 
variable to enter the solution path of (21). In other words, covariance tests select 
the target of inference based on (j*, s*), where 

{j*,s*)= ^argmax|a;J(y-VA(0))|,sign(a;J.(y-VA(0)))^ , (23) 

and the test statistic is Ai = |a;J. {y — VA(0))|. 

4 -2.1. Affine selection procedure 

The selection procedure is based on {j*,s*) dehned in (23), it is easy to see that 
it is equivalent to 

2 :fe(y- VA(0)) < s*a;J.(y- VA(0)), k = l,...,p, 

-xl{y < s*a;J.(y- VA(0)), k = l,...,p. 

Writing in the form of A(j*, s*)y < b{j*, s*), we have 


f xf - s*xj', ^ 


/ {xf-s*xJ,)XA{0) \ 

T * T 

tX/ p O tiu j* 

h — 

{Xp - s*xJ,)\7A{0) 

— Xl — S Xj* 

5 ^ — 

-{xi -1-s*a;J.)VA(0) 

[-X^-S*xjJ 


K-{xp +s*xJ,)\7A{0)J 


We notice that Ai = s*xj, {y — VA(0)). Thus to test the global Hq : = 0, we 

simply take 


rj = s Xj*. 
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The challenge in establishing a result for the covariance test for GLM is the 
lack of Gaussianity in the data distribution. The tools we develop in this paper, 
however, can circumvent this. But we first need to establish the resulting pivot 
which we can use to test the hypothesis Hq '■ = 0. Note that yi \ Xi 

G{fL{xi),a{xi)^), thus if G were normal distribution, we will have an exact 
pivot by applying Theorem 1. This result is also given in Taylor et al. (2013). 
Formally, we have the following corollary. 


Corollary 1 (Global test for Gaussian errors). Suppose yi X G 

fixed, define pt = {yi,..., fin), S = diag{al,... ,a^}. After getting the 
first knot on the solution path of (21), we get {j*,s*) as defined in (23) and 
Al = \xjt(y — ^)|. Furthermore, we also define Qjk = xjT,Xk and 

_ s{xk — — g) 


sjxk 0j*fc/0j«j*3;j«) {y ft) 
(s,fc):sG{ — , 1 SS Qj*k/^j*j* 

l-ss*0j»fc/e3.j.<O 


Then, 


$ 




- 


$ 


I Uji* ,s*) ] — ih f Tjj ] 


Unif(0,1) 


(24) 


Gorollary 1 gives a pivot (24) which we can use to test the global null Hq : 
/3° = 0 and control the “Type-I error” (1). In practice, we often normalized the 
columns of the design matrix X. In addition we may assume the observations 
y/s are independently distributed with the same marginal variance, i.e. S = cr^I, 
then = oo and L(^j* s*) simplifies to the second knot in the solution path 

A 2 , thus we have: 


l-^(Ai) 
1 - 4>(A2) 


Unif(0,l). 


(25) 


For the pivot (25) to converge to Unif(0,1), we need to consider the number 
of states |5|, the bound on the influence M{j*,s*) as well as the choice of Sn in 
Assumption 1. 


4-2.2. The conditions for the pivot to converge 

Since j* G {1, ■. ■ ,p}, and s* G {—1,1}, the number of possible states |5| are 
naturally bounded by 2p and r(S*) = 2p. We assume E = cr^I, and X are nor¬ 
malized columnwise to have norm 1. We first introduce the following condition 
on the design matrix X, which states that any two columns of X cannot be too 
correlated. 
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Assumption 6. Suppose there exists p > 0, sueh that 

xfxj < <1, if^j, i,j e {l,2,...,p}. 

Under Assumption 6 , it is not hard to verify 


1 - P 

Now we need to pick the Sn’s such that Assumption 1 holds. In particular, we 
choose d„ = (Vlog p„)~^~'^, for some k > 0. Now if we apply Theorem 3, we 
have the following result, 

Corollary 2. Suppose y\X is generated independently coordinate-wise through 
the distribution in (20) with the same marginal variance. Assume the eolumns 
of X have norm 1, Assumption 6 is satisfied and maxy \Xij \ = 0{n~'^). Then 

if 

{\ogpnf^^'^n~^ -)> 0 , 

the pivot eonverges to Unif(0,1) under the global null Hq : /3° = 0, 


l-$(Ai) 

1-$(A2) 


4 Unif(0,l). 


5. Proof of the theorems 

Without loss of generality, we restrict our interest to the case /i = /i(A) = 0,E = 
E(A) = I. This is possible since any affine selection procedure £* applied to 
data with mean p.{X) 7 ^ 0 is equivalent to a centered affine selection procedure 
£*’^ applied to the centered data. Specihcally, the linear part of £*’^ is the same 
as £* and the offsets are related by 

b°{£) = b{£) - A{£)p. 

Further, note that all quantities in the theorems above are independent of b. 
Scaling of the errors is handled in a similar fashion. 


5.1. Proof of Theorem 1 

Analogous to the proof in Lee et al. (2013), we prove Theorem 1. 

Proof. To lighten notations, we suppress all dependencies on X as it is assumed 
known. Note that {^*(3^) = £} = {A{£)y < h{£)}. Thus 

c(p{£fy\£*{y) = £\=cip{£fy\A(£)y < b(£)]. 
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Dropping the dependence on £ for the moment, 


{^3^ <b} = {Ay - E[Ay\7j^y] < b - EiAyirj'^y]}, 

= {E[Ay\r^'^y] <b-{Ay-E[Ay\ij'^y])} 

= [arj^y <b-Ay + arj^y} 

= [ajij'^y < bj - {Ay)j + ajify, j = 1 ,..., fc}. 

In other words, [Ay < 6} = {A{£)y < b{£)} = {Ls{y) < ij'^y < Usiy)}, and 

c(^v{£fy\£*{y) = £^ ^c(^v{£fy \ Lsiy) < r^isfy < Usiy)^ , 


Note also from the derivation above that {Ls{y), Us{y)) is independent of yy 
for each £. Thus if we condition on £*, Us*{y) and L£*{y), r]{£*yy is dis¬ 
tributed as a Gaussian r.v. with mean 0 and variance ||?7(£*)P truncated at 
Us* and Ls* ■ Therefore, 


Fi^y-, Ihf, 0, Ls* {y), Us* (3^))|r = f , Ls* {y), Us* {y) Unif(o, i). 


Considering that conditional on £*, r]{£*yy is independent of Us* and Ls*^ 


we have (7). 


□ 


5.2. Smoothing the maxima of affine functions 

In the proof of Theorem 2 and the related lemmas and corollaries, a technique 
developed by Chatterjee (2005) is frequently used. Roughly speaking, we want 
to study convergence of functions like Ls and Us which can be expressed as 
maxima or minima of affine functions. These non-smooth functions are replaced 
by a smoothed surrogate at the cost of a factor appearing in their derivatives 
depending on the smoothing parameter. 

Specifically, we are interested in how this smoothing affects the following 
quantities. 


Definition 3. For any f G define 



(26) 


For any finite collection of functions define 

Xr{F) = maxAr.(/). 


Definition 4. For any g G C^(T>,R) where C and any multi-index a = 
(oi, 02 , as): we define for r = 1, 2, 3, 
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Now we define the smoothed maxima operator. 

Definition 5. Let f : R" -A and define f = maxy^jrv, where 
T= {z^vj{z),vj gC'3(R",R3)} 

is a finite collection of thrice differentiable functions Vj’s. The maximum is taken 
coordinate-wise. 

define the smoothed maxima operator with parameter fj as 


nf,p) 



^ exp{l3vj) 


G C3(R",M3), 


(27) 


where the operators log and exp are applied coordinate-wise. 

Suppose the range of /, r(/, /?), denoted as Ti{f),TZ{T{f, fi)) C V and let 
h = g o f, hjs = g o r(/, /3), then Lemma 5 gives a bound on \\h — hpWoo and 
Xsihp). 

Lemma 5. Assume the same notations as above, s = \T\, then for fi > 1 

\\h-hp\\o,<C,{g)-^\ogs, (28) 

A3(/i/3)<13c-^2^3(ff)A3(^), (29) 


where c is a universal constant. 

The proof of Lemma 5 will refer to the following lemma whose proof we leave 
in the Appendix. 

Lemma 6. For any f G C^(IR.”;K^) and g G (^^(M^jR), r = 1,2,3 


Ar(5 O /) < cCrig) ■ Xrif), VZ = 1, 2, . . . , 71, X G 

where c is a universal constant. 

Now we prove Lemma 5. 

Proof. Note that for any u G R® 

max Uj = — log 

i<i<s 0 


exp 1 0 max tt,- 
t<j<s 


< 4 log 


X! 

i=i 


s exp I B max u/) 

' i<j<s 


<llog 

= — log s + max Uj. 

0 t<3<s 


(30) 
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We take h = go f, and hp = g o r(/, /3), 

\h{z)-hp{z)\ = \go f{z)- goT{f,P){z)\ 

<3Ci(5)||/(z)-r(/,/3)(z)|U 

< 3(71(5) • ^logs, 

where the oo norm is the element-wise maximum absolute value. Thus we proved 
(28). Now let / = {fi,f 2 ,h) and Vj = {vij,V 2 j,V 3 j), and define 

r, = {z^v,j{z),v,j e C3(R”,K)}, i = 1,2,3. 

Theorem 1.3 in Chatterjee (2005) proved that 


A3(r(/„/3)) < 13^2^3(^*), V/3> 1 , z = 1,2,3. (31) 

Note A 3 (r(/,/?)) = maxi=i,2,3 A3(r(/i,/3)), and that Xai^) = maxi=i,2,3 A3(J7), 
thus 

A3(r(/,/3))<13/l2A3(J-). (32) 

This combined with Lemma 6 proves (29). □ 

5.3. Proof of Theorem 2 

To prove Theorem 2, we first prove the following lemma. Recall our reduction 
to the standard Gaussian iV(0, 1) in the beginning of Section 5.1. Lemma 7 is a 
simple adaption of Lindberg’s proof of the CLT. 

Lemma 7. Assuming the same notation as in Theorem 2, for any smooth func¬ 
tion heC^{R^), 


\Eh{y)-Eh{y)\ < lA3(h)nmax(7,E(|3;z|3)) (33) 

6 I 

We will prove Lemma 7 now. 

Proof. The proof proceeds by following the Lindberg proof of the CLT for h. 
Define 


y' = (yi,y2,---,yi,3^/+i,---,3^n), 
y = {yi,y2,---,yi-i,yi,---,yn), 

W- = (51,52, ■ •. ,5z-i,0,3^i+i,.. .,yn). 

We can break the absolute difference of the two expectations into n parts, 

n 

\Eh{y) - Eh{y)\ < \^Hy') - Eh(y )|. 


(34) 
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Note that 

/i(y') - h{W^) = dih{W^)yi + ldfh{W^)yf + R\ 

h{y) - h{w^) = dih{w^)yi + \dMw^)yf + t\ 


where |i?'| < g|19fh||oo|yiP, \T^\ < |||9fh||oo|3^ip. Moreover, because y/s and 
y/s are independent, is independent of both yi and 3^;. Continuing, we see 
the first and second order differences cancel out. 


|E%')-Eh(y)i 
yEdikiW^yi - yi)\ + 

=\Edih{W^)E{yi - yi)\ + 
=E|i?' -T'|. 


^Edfh{W^){yf - 

^Edfh{W^)E{yf 


yf) +E|i?'- 
-yl) +E\y 


T'I 


T'I 


Combining the n parts, we have 

1 ” 

|Eh(y)-Eh(3^)| < -A3(h)^ [E(|3^,p)+E(|yz|3)] 

< iA3(/i)nmax(7,E(|3^jp)) 

6 I 


□ 


Now we turn to the proof of Theorem 2. 

Proof. Note that since {A{Ei^X)y < b{£i)}, 1 < i < |5| are disjoint and for any 
state £ 

{Ai£,X)y < bi£)} = {Lsiy) < r,i£fy < Usiy)}. 

Therefore the quantity of interest is 

W{y{£*yy,Le*{y),Ue-{y)) = Wfq{£ff y,L£,{y),Us,{y)) 

If we knew the above quantity was smooth with respect to the data, we can 
apply Lemma 7 directly. However, there are two non smooth expressions above: 
the maximum function over the states and in Ls, and Us, ■ We smooth each and 
optimize over the smoothing parameter. 
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We first smooth over Us and Lg. Note that Vz G H." 


Le{z) 


Ub{z) 


- iM£)z)j + ajifz 

ilidX - 

aj<0 OLj 

max{?;|n G ^l}, 

. b{£)j - {A{£)z)j + ajij'^z 
mm -^— - - - 

aj>0 Oij 

min{n|n G J'j/}, 


where are the collections of affine functions 

( b{£)j - {A{£)z)j + ajrf^z 


Tl = {vj : z ^ 
Tu = \ vj : z ^ 




b{£)j — i^i£)z)j + ajtf' z 


aj < 0 


aj > 0} . 


Finally, note that 


max(A 3 (J'L),A 3 (J'c/)) < M{£,r^f. (35) 

We define the smoothing parameter (3 = 1/5, for some 0 < (5 < 1. Then 


Le,s{z) = T{Ls{z), 1/5) = 5 log 


H exp 


5 


Ue,s(z) 


-r{-U£{z),l/5) 


—^ log 


H exp 

Vj €Tu 



By Lemma 5, for any selection state £ and 2 G M", we see 


\Ls{z) - L£^s{z)\ < 5\ogr{£*) 
\U£{z) - UeAz)\ < 5\ogr{£*) 


(36) 


(37) 


Based on (37) and Lemma 5, we have 

\m¥{i^[£*)^y, Ls, {y),U£. {y)) - m¥{y{£*)^y, (j;), Ue^ (3^))! 

<\m¥{y{£*)'^y, Ls^ Ay),U£^ Av)) - Ew{y{£*)^y, Us*Ay))\ 

+ 125Ci{W) logr(£:*) 

= \Km&xW A{£^)'^y, L£.^s{,y),U£^^s{,y)) -'E,m&xW{y{£^)'^y,L£.^siy),U£,^siy))\ 

+ 125Ci{W) logr(£:*) 

(38) 

The equality follows from the fact that C/g > U£^s and Lg < Lg,^, for any 
state £ and that W is supported on D = {{u,v,w)\v < u < w}. Therefore, 
W{y{£*)'^y,L£^^siy),U£-,siy)) = maxg, W{y{£i)'^y, L£.^siy),USi,siy))- 
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Next, we smooth the maximum over states. Define 

W£{z) = W{r]'^z,Ls^s{z),Us^s{z)) £eS, z G M’' 
We also define the smoothed maxima for maxi<i<|5| W£^(z), 


Hs{z) = r(^max ^ WsM), l/<5) = <51og 


E 


exp 


V 


Thus by Lemma 5, 

\EW{ri{£*fy, Lg. {y), Us-{y)) - EW{y{£*fy, Lg. (3^), C/f. {y))\ 
<125log |5| + 125Ci(W) logr(f*) + \EHsiy) - EiJ5(3;)|, 


(39) 


For any state £, define fs,£ : ^ >->■ {ri{£)'^z, L£^s{z),U£^s{z)). Theorem 1.3 in 
Chatterjee (2005) states that 


hiHs) < ^maxA3(lT o/^fj. (40) 

0 '^ i 

Per Lemma 5 and inequality (35) 

~ 13c 

A3(hF o fsx) < —C^{W)M{£\yf. (41) 

From (33) in Lemma 7 together with (40) and (41), we have 

169c 

\EHs{y)-EHs{y)\ < -^C^{W)nM{£\yf mt,^{^,E[yf]) ( 42 ) 

All the above combined, we have 

\EW{y{£*fy, L£. {y),U£. {y)) - EW{y{£*fy, Lg. (3^), C/^. (3^))| 

16Qc 1 

<125[log|5| +Ci(W) logr(r)] + ■ C3(VF)nM(r, 77)" max(7, E[3^f]). 

Notice that for the last inequality to hold, we require 5 < 1. But the optimal 
5® = 0{nM{£* ,r])^/\log |5| + log r(£’*)]), which will go to 0 since the numerator 
shrinks to 0 while the denominator goes to 00 as n,p —>■ 00. Therefore, the 
inequality holds. 

Optimizing over S yields (14). □ 


5.4- Proof of Theorem 3 

Now let’s turn to the proof of our main result. Theorem 3, 

Proof. For the convenience of notation, we denote P{x; a^,m, a, b) by P{x; a, 5), 
omitting ,m in the following proof. Define 

D{S) = {{x, a,b) ■. a < X < b,b — a > S, min(|5|, |a|) < 1/5} . 
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We claim that for any small | > <5 > 0, we can find a thrice differentiable 
function Ps such that 


Pg is supported on the set {a < a; < 6} 


Ps-P 


< {Ki + l)d on the set D{S), 


CsiPs) < K 3 • ^ on the set D{S), 
0 ° 


where Ki and if3 are defined in Corollary 3. 

The proof of the existence of such a function Pg is left to Lemma 9 in the 
Appendix. Then for any positive, bounded function 'I' G K'*'), ^'(0) = 0, 

with bounded third derivatives, we have: 


o P5(2/; Lg., [/£.) - o P(y; Lf., t/g.) I 

< Il'I^'lloo (Ki + l)S + 2 • p(c/£*(y) - Le^{y) < 

+2 ||vl/||^ . P(^min(|t/£.(y)|, |Lf.(y)|) > l/<5), 

|E^- o (3;; Ls.,U£.)-E ^o P{y- Le ,, C/f.)I 

< Il'I^'lloo (^1 + + 2 ||'I'ILp(t/£.(3^) - Ls^{y) < 

+2 ||vl/|l^ P(^ min(|C7£. (j;)!, |Lf. (3^)|) > 1/d) , 


(43) 


On the other hand, we plug in 4/ o P,5 as the W in Theorem 2, then for any 
sequence of Pg „, 


\E^oPg^[y)-E^oPg^{y)\<C 


K^--{\ogr{£*)\S\fnM{S\yf 

^rj. 


(44) 


If we choose a subsequence 5n —t 0, such that the right hand side of (44) goes 
to zero, then 


|E4'oP( 2/) -E4'oP(3;)| 

^lE^- o Pg^ (y) - E^- o Pg^ (3^)1 + 2 II4-'11^ (Ad + l)d„ 

+2 114/11^ P(^[7^.(2;)-L£.(y)<d„) 

+2 ||4/||^p(^min(|t/£.(2/)|, l^f(?/))l > 

+2||4/||^p(^t/^.(3^)-A£.(3^) <d„) 

+2||4/||^p(^min(|t/£.(3^)Ui£*(3^)l) > V^n) 

^ 0 . 
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Note that the right hand side of (45) goes to 0 because of Assumption 1. Thus 
we have the conclusion that P{y) A P(y) ~ Unif(0,1). 

□ 


6. Discussion 

This work proves a generic framework in which asymptotic results hold for many 
selective inference problems. It is, however, not directly applicable to some other 
procedures. Further work may include, 

1. Fixed A for generalized linear model. 

Our work derives a theory for inference after the affine selection procedure. 
However, inference for a fixed A for the generalized linear regression is not 
an affine selection procedure. A plausible solution will be to approximate 
the loss function of GLM by a quadratic form and bound the difference 
between the quadratic form and the GLM loss function. However, this is 
still an open question. 

2. Apply the result to nonparametric problems. 

It is a big step to remove the Gaussian assumptions required by Lee et al. 
(2013) which restricts our attention to Gaussian families. Without the 
Gaussian constraints, we can consider some exponential families and po¬ 
tentially some nonparametric problems as well. 
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Appendix A: Proof of Lemma 6 

Proof. Using the chain rules, we have the second derivatives with respect to 

xi, 1 = 1, 2 ,..., n as 



^ dg aV. 

k 


(46) 


and the third derivatives with respect to ip / = 1, 2 ,..., n as 


d^go f _ ^ d^g f dfidfjdfk \ .ok 

■ k-^ 9xi dxi J A/ d/dfj V dxf dxi ) 


(47) 



For r = 1, the conclusion is obviously true with the constant c = 3. For r = 2, 
the terms involving the partial derivatives of / are 
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Note the first type of terms are bounded by A2(/)^^^ • A2(/)^^^ and the second 
type of terms are bounded by A2(/). If we take c = 12, \dfgof \ < cC'2(g)A2(/). 
On the other hand, 


\digof\^ < 


-A dg df^ 

h 


< [302(3)2 A2(/)2]^ < 902 ( 3 )A 2 (/). 


(48) 


For r = 3, an equation similar to (48) will give us \dig o /p < 27C3{g)X3{f). 
Meanwhile, 


\dfgof\ 


3 

2 = 



< 


903(3) ^ (A3(/) • A3(/) 3 ) + 303(3) ^ ^ ) 


= 12i03(3)A3(/). 


For the third derivatives dfg o /, the terms that involve / are, 

dfj d^fi 

dxi dxi dxi dxf dxi dxf 

which are all bounded by Xaif) and therefore X 3 {g o /) < 57 C3{g)X3{f). In 
summary, we can take c = 57. □ 


Appendix B: Existence of smooth approximation P 


We prove the existence of such functions as claimed in the proof of Theorem 3. 
Define P{x, a, b) = P{x; a^, m, a, b). We first prove the following lemma. 

Lemma 8. Define 


= { {x,a,b) ■. a < X <b,b — a> min 5, — . 

' ' mm(|o|,|a|) 


Then, on D{5) for any 5 < 1/4 we have 




„-P/2 




max 


$(6) - $(a) ’ 4>(6) - $(a) ’ $(6) - $(a) 

for some universal constant O. 

Proof. Note that for any d > 0 on 

D{S) n {(x, a, b) : sign(a) = sign(6)} 


< O max ((5 ^ , min 


we have 


<max(e-“'0,e-'''0) 
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SO it suffices to prove that 


-a‘^/2 




max 


$(&) - $(a)’ $(6) - $(a) 
on D(S) for <5 < 1/4 as well as 

e-xV2 


< Cb 


-1 


sup 

(x,a,b)e-D((5)n{(ai,o,6):sign(a)^sign(6)} 


< C(5-\ 


Let’s consider this latter case first. For any 5 > 0 on the set Z)(^)n{(a;, a, 6) : sign(a) 7 ^ sign(6)} 
we have 




< 


1 


^{b) - 4>(a) “ ^{b) - $(a) 
< 


< 


ffif(o, 6 ):sign(a) 7 ^sign( 6 ), 6 —a>(5 


be 


Continuing, we further split the first case into two cases, i.e. Zl(^)n{(x, a, b) : sign(a) 7^ sign(6)} 
and D{b) fl {(x, a, 6) : sign(a) = sign(6)}. For the first part, analogous to the 
analysis above, we have 


,-oV2 


,-f>V2 


< 


1 


$(6) - $(a) ’ $(6) - $(a)y 4>(6) - $(a) 

< 


< 


^^^(a, 6 ):sign(a):^sign(&),&—a>(5 


be 


Now, we reduce to the case D{b) fl {(x, a, b) : sign(a) = sign(6)} and without 
loss of generality, we consider the case where 0 < a < b. Note that for any (5 > 0 
on D{b) n {(x, a, b) : sign(a) = sign(6)} fl {0 < a < 6} 


,-aV2 


„-&V2 


,-aV2 


max 


max 


< sup 


$(6) - <l>(a)’ m - $(a) / - J/, $(a + <5) - $(a) 


,-aV2 


= -bV2 


< sup 


,- a ^/2 


m - 4>(a) ’m - $(«) J - $(a + 1/a) - $(«) ’ 

For ^ < 1/4 and 0 < a < 1/^, 

(27r)^/2($(a + <5) - 4>(a)) > 
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Or, 


,-aV2 


$(a + (5) - $(a) 


< CiS-\ 


where Ci = 

Similarly, for 0 < 1/(5 < a. 


(27r)i/2($(a + 1/a) - $(a)) > 


Or, 


,-aV2 


= _e-oV2g-l-l/(2aO 


< a(2^)i/2ei+V(2aO 


$(a + 1/a) — $(a) 

< a(2^)i/2ei+V(2aO 

< a(2^)i/2ei+^^' 

< Cia 

where Ci = Therefore, we have on £>((5)n{(a:, a, b) : sign(a) = sign(6)}n 

{0 < a < 6}, 


= -a"/2 


,-f>V2 


max 


$(6) - $(a) ’ $(6) - $(a) 


< max ( -, min(|a|, |6|) 


□ 


Remark 2 . If we take 

D{S) = {{x, a,b) ■. a < X < b,b — a > S, min(|a|, |6|) < 1/5} , 

then 


-xV2 


= -aV2 


= -bV2 


< C'5-\ 


I $(6) - $(a) ’ $(6) - $(a) ’ $(6) - $(a) 
where C is a universal constant. 

Corollary 3 . For 5 < 1/4 and any multi-index a = (01,02,0:3) we have 

sup Ci^iiP) < K\a\6~^°^. 

{x,a,b)^D{S) 

for constants Ki^l > 1. 

Proof. We prove for a = (0,0,1), and similar proofs can be extend to other 
multi-index o as well. Since P{x, a,b) = 2 min(T'(a;; a, 6), 1 — F{x; a, 6)), we only 
need to prove for F{x; a, b). 


$(x) — $(a) 1 


OF 


db [$(6) - d>(a)]2 


exp(-6^/2). 
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Therefore, 


dP 

1 

$(x) — $(a) 

db 


$(6) - $(a) 


exp(-bV2) ^ 1 

$(&)-$(a)- S' 


□ 


Finally, we put the lemma and the corollary together and prove the following 
lemma. 


Lemma 9. There exists a thriee differentible approximation P to P that satis¬ 
fies, 


• P{x, a, b) is supported on {{x, a, b) : a < x < b}, 

• C^{P) < on the set D(S), 

< {Ki + l)i5 on the set D{S). 


Proof. Let Ps be the smoothed version of P for the minimum function in P = 
2 min(P, 1 — F), and ||P5 — P|loo < S. Let Ps = Psh^ {x, a, b), where Is^ (x, a, b) 
is the smoothed version of the indicator function on {a < x < 6}. Is'^{x,a,b) 
also satilies the condition that 


Is2{x,a,b) = < 


0 X < a, or X > 6, 

1 a-\-6^<x<b — 6^, 

^ [0,1] else. 


Is 2 {x,a,b) also satisfies C^ils^) < for some universal constant C. Thus it is 
not hard to verify that Cz{Ps) < Ks-^- 


Ps 



< \\P-PIs4co + \\PsIs^-PIs4co 


< sup ||-P||oo+(^ 

a<x<a-\-S^, or b—5^<x<b 


<c4P)-s^ + s 

< {Ki + 1)S. 


□ 


Appendix C: LASSO related proofs 
C. 1. Proof of Lemma 1 

We first introduce the following Lemma in Negahban et al. (2012). 
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Lemma 10. If we assume the same assumptions and notations as in Lemma 
1, then with probability at least 1 — ci exp(—ciA^), the following two inequalities 
hold: 


A > 2||Ji:£||„, 


m 


(49) 

(50) 


where k is the number of nonzero entries in /3° and j3 is the solution to (18) 


Proof of Lemma 1, 


Proof. According to the KKT conditions, 

fa;J’(y-X/3) = Asign(/3), if j £ A, 
\\xj{y - XP)\ < X if j^A. 


According to Lemma 10, we assume both (49) and (50) hold. This happens 
with probability 1 — ci exp(—ciA^). For any j, 

xj (y - X^) = xj {Xl3° - XP + e) 

= xJX {l3° - P) +xje 

>xJX(^°-/3)-^. 

Thus for j € A, 

\xJX0-P0\<^. 


\\X^X0 -13°)\\1 = Y^ (^x,X0 -0)y + J2 (x,X(/3 - /3°))' 

jGA j^A 

>Y,(x,x0-p^)y 

J&A 

> y|supp(/3)|. 

Also, 


0||2 


\\x^xip-n<\\x^x\\i\\p-ii^ 




4/cA2 




< 0 


2 

max 


4kX^ 
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Combining the two inequalities, we have that 


|supp(/3)| < 


IQktpl. 

rn? 


holds with probability 1 — ci exp(—ciA^). 


□ 


C. 2. Proof of Lemma 2 

Proof. Note = {X^Xe)~^X^. According to the assumption assumed in 
Lemma 2, (/imin > t'- Thus for any possible active set E, 

max I [{X]^Xe)~^] \ < l/z/^. 

The above result can be easily obtained using Singular Value Decomposition on 
Xe- Therefore, we have 

max|A^|y < max |Ay I 

i,j i,3 


□ 


C. 3. Proof of Lemma 3 


Proof. Without loss of generality, we assume /3° = 0. We first see that for any 
fixed E, and r]n chosen as in (11), the upper and lower bound simplies to 


Us = oo, Ls 


max 

kGE,k^j 


A ■ r[l + {\E\ - l)p] - Pk,E 
P 


+ Pj,E, 


where Pe € 


is the least square estimator with the E variables, 
Pe = {XIXeT^XIv. 


Note that the first two equations of (15) are automatically satisfied in this case. 
Without loss of generality, we assume Ls* > 0, and noticing Ls* < maxfLg, 
we have 


P(Tf(2/„) > l/5n) < P(maxL£(y„) > l/(5„). 


Since max^ Ls(jjn) is the maximum of at most sub-Gaussian variables, thus 
the RHS is bounded by 0(e“^") = 0(p“''), which goes to 0. □ 


C.f. Proof of Lemma f 

Proof. Per Lemma 1, we have 

\Er,\<cK, 
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with probability at least 1 — ci exp(—ciA^). 

We modify the selection procedure £* on the small probability event. More 
specifically, we define £* as 


£n{yn,Xr,) 


£*{yn,Xn), it\E^\<cK, 
no selection, else. 


It is easy to see that £* is also an affine selection procedure, which differs from 
£* only on the event {\En\ > cK}. Thus the pivots formed with £* and 
converge in probability, 


P J/ji; rj^ YlnVm Vn Mm Vs* j ) 

Vn: Mn ^nVm Vn Mm p£* j Pg* )] 


pCi 


Therefore, we only need to consider the asymptotic distribution of the pivot 
with f* as the selection procedure. Note that for £*, 

~ cK ~ 

XI{£*,'nn) <-maxlXijl, r{£*)<p, \Sn\ < ■ 

Now with our choice of S^s, it is easy to rewrite the condition in Theorem 3 as 

n-P\\ogPny+^^ ^ 0 . 


□ 




